New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Snippets][CPU] Added external repacking via BrgemmCopyB #28179

Open

a-sidorova wants to merge 7 commits into openvinotoolkit:master from a-sidorova:feature/snippets/external_repacking

+527 −99

Contributor

a-sidorova commented Dec 23, 2024 •

edited

Loading

Details:

Added two separate implementation for external repacking: in parallel section with kernel and in the separate parallel section before kernel execution

Tickets:

159886

TODO:

Adjust heuristic of the impl choosing
Add layout support

github-actions bot added the category: CPU label

a-sidorova force-pushed the feature/snippets/external_repacking branch 4 times, most recently from c6c62b5 to 9e85ea1 Compare

December 23, 2024 11:19


          [Snippets][CPU] Added external repacking via BrgemmCopyB

2b536b7

a-sidorova force-pushed the feature/snippets/external_repacking branch from 9e85ea1 to 2b536b7 Compare

December 23, 2024 11:24


          [Snippets][CPU] Fixed build on non-x64 platforms

b665659

a-sidorova force-pushed the feature/snippets/external_repacking branch from f2523ef to b665659 Compare

December 23, 2024 12:05

a-sidorova added this to the 2025.0 milestone

a-sidorova marked this pull request as ready for review

December 23, 2024 13:29

a-sidorova requested review from a team as code owners

December 23, 2024 13:29


          [Snippets][CPU] Updated heuristic

72bf13a

IvanNovoselov reviewed

View reviewed changes

Contributor

IvanNovoselov left a comment

The first part

src/plugins/intel_cpu/src/nodes/subgraph.h

    
            @@ -172,10 +173,48 @@ class Subgraph::SubgraphExecutor {
          
                  inline void segfault_detector();

              #endif

              private:

                  std::vector<MemoryPtr> reorder_inputs(const dnnl::stream& strm, const std::vector<MemoryPtr>& inMemPtrs);

              #ifdef OPENVINO_ARCH_X86_64

Contributor

IvanNovoselov Dec 24, 2024

Long #ifdef blocks significantly reduce readability. Do you think we can create different executors of x86-64 and arm?
Note that we can use variadic templates + perfect forwarding to avoid repeating these long argument lists in constructors.

Contributor Author

a-sidorova Dec 26, 2024

I thought about the separate executors during developing.
Firstly, I believe that the current impl is temporary solution and nothing arch-specific should be in executors. So if we implement the separate executors, to after some time (probably, not so soon) we will should to unite them again.
Secondly, we have two separate executors for dynamic and static shapes. If we want to split executors by arch, will be there 4 executors (2 arch x 2 shape types)?

I'd say that this is open question. And I'm glad to discuss it offline 😃

src/plugins/intel_cpu/src/nodes/subgraph.cpp

Comment on lines +100 to +101

		if (should_repacking_be_in_parallel())
		in_parallel_repack_inputs(inMemPtrs, indexes, ithr, call_args);

Contributor

IvanNovoselov Dec 24, 2024

As far as I understand, RepackingImplType can't change after the executor is create, so there is no need to check it inside the caller. To minimize runtime overheads, we should create different callers for different RepackingImplType.
Moreover, since we already have a builder, I would suggest (at least to consider) exposing RepackingImplType as the executor's template parameter. We can then use if consexpr (...) to take advantage of compile-time resolved branches ==> enjoy zero runtime overheads

src/plugins/intel_cpu/src/nodes/subgraph.cpp Outdated


		uint8_t* repacked_ptr = get_external_scratchpad_ptr(ithr, in_idx) + dst_offset;

		auto& offsets = m_repacked_offsets_by_threads.at(ithr)[in_idx];

Contributor

IvanNovoselov Dec 24, 2024

Since we use ithr for indexing, then why do we need an unordered map here? It looks like vector could be just as good here, but faster.

Contributor Author

a-sidorova Dec 26, 2024

Makes sense. Thanks!

src/plugins/intel_cpu/src/nodes/subgraph.cpp

Comment on lines +1076 to +1079

+                      for (size_t j = 0; j < indexes.size(); j++) {
+                          src_offset += src_offsets[j] * indexes[j];
+                          dst_offset += dst_offsets[j] * indexes[j];
+                      }

Contributor

IvanNovoselov Dec 24, 2024

This loop can't be vectorized because indexes.size() is not known at runtime. That's a pity, because we know that indexes.size() == 6 in 96% cases. Of course it's hardly a bottleneck, but its quite easy to allow for this optimization - again templates 🙂 : indexes size must be a template parameter.
it also makes sense because input ranks can't change in runtime, so there will likely be only one instantiation of this template

src/plugins/intel_cpu/src/nodes/subgraph.cpp Outdated

+                      uint8_t* repacked_ptr = get_external_scratchpad_ptr(ithr, in_idx) + dst_offset;
+                      auto& offsets = m_repacked_offsets_by_threads.at(ithr)[in_idx];
+                      if (offsets.count(src_offset) == 0) {

Contributor

IvanNovoselov Dec 24, 2024

A similar concern: although an unordered_map access is O(1), but the constant is rather large.
It would be much faster to create a vector<bool> is_repacked (which is implemented through a bitset => takes very little space). The only question here is whether we can come up with a sufficiently dense hash function to avoid wasting too much space.
it looks like indexes[0] * indexes[1] * ... * indexes.back() is good enough. Especially considering the fact that this type of executors is used only when the parallel work amount is small.
What do you think?

Contributor Author

a-sidorova Dec 26, 2024

Please take a look at the my explanation on the last your comment. I believe you will get my point after reading 😊

As for indexes[0] * indexes[1] * ... * indexes.back() - I think we cannot use it because the multiplication of the indexes [0,0,...,0] and [0,1,...,0] will be equal to 0. However, probably we need to repack data for the second indexes too

src/plugins/intel_cpu/src/nodes/subgraph.cpp Outdated


		uint8_t* repacked_ptr = get_external_scratchpad_ptr(ithr, in_idx) + dst_offset;

		auto& offsets = m_repacked_offsets_by_threads.at(ithr)[in_idx];

Contributor

IvanNovoselov Dec 24, 2024

Another question here: why do we need to check every repacked_input here via in_idx?
I thought that it's either no inputs for this thread are repacked (it's the first time this thread processes this batch), or all inputs are repacked (this thread already processed some date from this batch).
Can it be that the thread processed only a part of m_repacked_inputs?

Contributor Author

a-sidorova Dec 26, 2024

Can it be that the thread processed only a part of m_repacked_inputs?

If I got your question, yes, it can be.

For example, we have repacked input shape [1, 1, 5, 1, 1, 1] (after splitting of dimension m), parallel_domain = [1, 1, 5, 16, 1, 1], work_amount = 5x16 = 90 and nthreads=16, the function splitter returns the following intervals for 5-th thread:
start = 15 and end = 18
It means that this thread will get indexes:

for i=15 - [0, 0, 0, 15, 0, 0] - makes a repacking data for first batch in repacked input shape
for i=16 - [0, 0, 1, 0, 0, 0] - moves on the next batch in repacked input shape and we need to make a repacking one more time but on new data.
for i=17 - [0, 0, 1, 1, 0, 0] - the threads leaves on the second batch in repacked input shape, no need to repack this batch one more time

Moreover, there might be case (but I think that it might be only in tests due to sense of MHA using in real apps), when repacked input shape is [1, 1, 1, 4, 1, 1] , parallel domain is [1, 1, 10, 4, 1, 1], count of threads is 4. In this case the first thread will process data with start = 0, end = 9 and following indexes:

for i=0 - [0, 0, 0, 0, 0, 0] - will repack first batch of repacked data
for i=3 - [0, 0, 0, 3, 0, 0] - will repack last batch of repakced data
for i=4 - [0, 0, 1, 0, 0, 0] - need to repack first batch of repacked data again due to broadcasting with parallel domain. Since we contain src offsets in the std::set, we will call offsets.count(src_offset) and will see that we already repacked them - no need to call BrgemmCopyB one more time.

However, I think this is really rare case (or impossible in real cases) when batch of inputs of SDPA are not equal. So I think we can store only one size_t value - just to check if we repacked this data on the previous iteration.

If you think we need to use std::set to cover the described above case, let me know and I will return std::set.

Thanks!

a-sidorova added 4 commits

December 25, 2024 08:34


          [Snippets][CPU] Added inplace-Transpose support

4b33eaa


          [Snippets][CPU] Applied Ivan comments

f1c7435


          [Snippets][CPU] Fixed code style

b6ffdaf


          [Snippets][CPU] Fixed prim isa

cced16d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels